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Abstract 

An important question when eliciting opinions 
from experts is how to aggregate the reported opin- 
ions. In this paper, we propose a pooling method 
to aggregate expert opinions. Intuitively, it works 
as if the experts were continuously updating their 
opinions in order to accommodate the expertise of 
others. Each updated opinion takes the form of a 
linear opinion pool, where the weight that an ex- 
pert assigns to a peer's opinion is inversely related 
to the distance between their opinions. In other 
words, experts are assumed to prefer opinions that 
are close to their own opinions. We prove that such 
an updating process leads to consensus, i.e., the ex- 
perts all converge towards the same opinion. Fur- 
ther, we show that if rational experts are rewarded 
using the quadratic scoring rule, then the assump- 
tion that they prefer opinions that are close to their 
own opinions follows naturally. We empirically 
demonstrate the efficacy of the proposed method 
using real-world data. 

1 Introduction 

Predicting outcomes of relevant uncertain events plays an es- 
sential role in decision-making processes. For example, com- 
panies rely on predictions about consumer demand and ma- 
terial supply to make their production plans, while weather 
forecasts provide guidelines for long range or seasonal agri- 
cultural planning, e.g., farmers can select crops that are best 
suited to the anticipated climatic conditions. 

Forecasting techniques can be roughly divided into statis- 
tical and non-statistical methods. Statistical methods require 
historical data that contain valuable information about the fu- 
ture event. When such data are not available, a widely used 
non-statistical method is to request opinions from experts re- 
garding the future event [Cooke, 19911. Opinions usually 
take the form of either numerical point estimates or prob- 
ability distributions over plausible outcomes. We focus on 
opinions as probability mass functions. 

The literature related to expert opinions is typi- 
cally concerned about how expert opinions are used 
[Mosl ehef ah, 1988), how uncert ainty is or should be repre- 
sented |Ng and A bramso nTT990| , how experts do or should 



reason with uncertainty [Cooke, 1991) , how to score the 
quality and usefulness of expert opinions [Savage, 1971| 
|Boutilier, 2012) , and how to produce a single consensual 
opinion when different experts report differing opinions 
[DeGroot, 1974) . It is this last question that we address in 
this paper. 

We propose a pooling method to aggregate expert opinions 
that works as if the experts were continuously updating their 
opinions in order to accommodate the expertise and knowl- 
edge of others. Each updated opinion takes the form of a 
linear opinion pool, or a convex combination of opinions, 
where the weight that an expert assigns to a peer's opinion 
is inversely related to the distance between their opinions. In 
other words, experts are assumed to prefer opinions that are 
close to their own opinions. We prove that such an updat- 
ing process leads to consensus, i.e., the experts all converge 
towards the same opinion. We also show that if the opinions 
of rational experts are scored using the quadratic scoring rule, 
then the assumption that experts prefer opinions that are close 
to their own follows naturally. 

2 Related Work 

The aggregation of expert opinions have been extensively 
studied in computer science and, in particular, artificial intel- 
ligence, e.g., the aggregation of opinions represented as pref- 
erences over a set of alternatives as in social choice theory 
[Chevaleyre et a/., "200 7.1, the aggregation of point estimates 
using non-standard opinion pools [Jurca and Faltings," 2008 1, 
and the aggregation of probabilistic opinions using prediction 
markets [Chen and Pennock, 20 101. 

A traditional way of aggregating probabilistic opinions 
is through opinion pooling methods. These methods are 
often divided into behavioral and mathematical methods 
[Clemen and Winkler, 1999) . Behavioral aggregation meth- 
ods attempt to generate agreement among the experts through 
interactions in order for them to share and exchange knowl- 
edge. Ideally, such sharing of information leads to a consen- 
sus. However, these methods typically provide no conditions 
under which the experts can be expected to reach agreement 
or even for terminating the iterative process. 

On the other hand, mathematical aggregation methods con- 
sist of processes or analytical models that operate on the in- 
dividual probability distributions in order to produce a single, 



aggregate probability distribution. An important mathemati- 
cal method is the linear opinion pool, which involves taking 
a weighted linear average of the opinions [Cook e7l991| . 

Several interpretations have been offered for the weights 
in the linear opinion pool. The performance-based approach 
recommends setting the weights based on previous perfor- 
mance of the experts (Genest and McCon way, 19901. A 
caveat with this approach is that performance measurements 
typically depend on the true outcome of the underlying event, 
which might not be available at the time when the opinions 
have to be aggregated. Also, previous successful (respective- 
ly, unsuccessful) predictions are not necessarily good indica- 
tors of future successful (respectively, unsuccessful) ones. 

More closely related to this work is the interpretation of 
weights as a measure of distance. For example, Barlow et 
al. (19861 proposed that the weight assigned to each expert's 
opinion should be inversely proportional to its distance to the 
most distant opinion, where distance is measured according 
to the Kullback-Leibler divergence. A clear drawback with 
this approach is that it only considers the distance to the most 
distant opinion when assigning a weight to an expert's opin- 
ion. Thus, even if the majority of experts have similar and 
accurate opinions, the weights of these experts' opinions in 
the aggregate prediction can be greatly reduced due to a sin- 
gle distant opinion. 

For a comprehensive review of different perspectives on 
the weights in the linear opinion pool, we refer the interested 
reader to the work by Genest and McConway 1 19901. 

3 Model 

We consider the forecasting setting where a decision maker 
is interested in a probability vector over a set of mutually 
exclusive outcomes 9\,...,d z , for z > 2. The decision 
maker deems it inappropriate to interject his own judgment 
about these outcomes. Hence, he elicits probabilistic opin- 
ions from n experts. Experts' opinions are represented by 
z-dimensional probability vectors fx, . . . , f n . The probabil- 
ity vector ii = . . . , fi tZ ) represents expert i's opinion, 
where f^k is his subjective probability regarding the occur- 
rence of outcome Ok- 

Since experts are not always in agreement, belief aggrega- 
tion methods are used to combine their opinions into a single 
probability vector. Formally, f = T(fx, . . . , f„), where f is 
called an opinion pool, and the function T is the pooling op- 
erator. The linear opinion pool is a standard approach that 
involves taking a weighted linear average of the opinions: 



order to accommodate the information and expertise of the 
rest of the group, expert i updates his own opinion as follows: 



T(fi 



■,fn) 



(1) 



where Wi denotes the weight associated with expert i's opin- 
ion. We make the standard assumption that < mj < 1, for 
every i e {1, . . . , n}, and w i = 1 - 

3.1 Consensus and Weights 

DeGroot II 19741 proposed a model which describes how a 
group can reach agreement on a common probability distri- 
bution by pooling their individual opinions. Initially, each 
expert i is informed of the opinion of every other expert. In 



F (i) 
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where pij is the weight that expert i assigns to the opinion 
of expert j when he carries out this update. Weights must be 
chosen on the basis of the relative importance that experts as- 
sign to their peers' opinions. It is assumed that pij > 0, for 
every expert i and j, and X)j=i Pi,j = 1- In this way, each 
updated opinion takes the form of a linear opinion pool. The 
whole updating process can be written in a slightly more gen- 
eral form using matrix notation, i.e., F' 1 ' = PF^ ', where: 



p(o) _ 



Pi,i 

P2,l 



Pn,l 
f 1 " 

h 



Pi, 2 
P2,2 



Pi,' 
P2; 



Pn,2 ' ' ' Pn,n 

fl,l fl,2 

f2,l f%,2 

fn,l fn,2 



and 



fl,z 
f2,z 

fn,z 



Since all the opinions have changed, the experts might wish 
to revise their new opinions in the same way as they did 
before. If there is no basis for the experts to change their 
weights, we can then represent the whole updating process 
after t revisions, for t > 1, as follows: 

p(t) _ pp('-l) _ ptp(0) (j) 

Let = (fi*if ■ ■ ■ , fi^l) be expert i's opinion after t 
updates, i.e., it denotes the zth row of the matrix pW. We 
say that a consensus is reached if fr' = f^ , for every expert 
i and j, as t — > oo. Since P, the matrix with weights, is a 
nxn stochastic matrix, it can then be regarded as the one-step 
transition probability matrix of a Markov chain with n states 
and stationary probabilities. Consequently, one can apply a 
limit theorem that says that a consensus is reached when there 
exists a positive integer t such that every element in at least 
one column of the matrix P* is positive [ De Groot, 1974) . 

3.2 Weights as a Measure of Distance 

The original method proposed by DeGroot 119741 has some 
drawbacks. First, the experts might want to change the 
weights that they assign to their peers' opinions after learn- 
ing their initial opinions or after observing how much the 
opinions have changed from stage to stage. Further, opin- 
ions and/or identities have to be disclosed to the whole group 
when the experts are assigning the weights. Hence, privacy 
is not preserved, a fact which might be troublesome when the 
underlying event is of a sensitive nature. 

In order to tackle these problems, we derive the weights 
that experts assign to the reported opinions by interpreting 
each weight as a measure of distance. We start by making the 
assumption that experts prefer opinions that are close to their 



own opinions, where closeness is measured by the following 
distance function: 



(3) 



i.e., it is the root-mean-square deviation between two opin- 
ions and f,. Given the above assumption, one can estimate 
the weight that expert i assigns to expert j's opinion at a given 
time t, for t > 1, as follows: 



(*) 
Pij 
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e + D f, 
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(4) 



where aj normalizes the weights so that they sum to one, 
and e is a small, positive constant used to avoid division by 

zero. We set ff ) = t i7 i.e., it is the original opinion reported 
by expert i. There are some important points regarding equa- 
tion First, the distance between two opinions is always 
non-negative. Hence, the constant e ensures that every single 
weight is strictly greater than and strictly less than 1. Fur- 
ther, the closer the opinions and fj* ^ are, the higher 

the resulting weight pfj will be. Since D ^f|* ^ , f|* 1 ^ = 

0, the weight that each expert assigns to his own opinion is 
always greater than or equal to the weights that he assigns to 
his peers' opinions. 

Now, we can redefine equation © so as to allow the ex- 
perts to update their weights based on the most recent opin- 
ions. After t revisions, for t > 1, we then have that PW = 
p(t) F (t-i) = p(t)p(t-i) _ ..p(i)p(0) ) where each element 

of each matrix is computed according to equation @): 



Algorithm 1 Algorithmic description of the proposed method 
to find a consensual opinion. 

Require: n probability vectors f} "* , . . . , fn ■ 
Require: recalibration factor e. 



for t = 1 to oo do 
for i = 1 to n do 
for j = 1 to n do 

(t) a 
Pij 



£+£>(ff- 1) ,fj t - 1) ) 

end for 

L i — l^j=l Pi 

end for 
end for 



„(t) f (*-i) 



where < 5 (U) , j(XJ) < 1, and U is a stochastic matrix. 
(5 (U) computes the maximum absolute difference between 
two rows of a stochastic matrix U. Thus, when S (F^J = 0, 

all rows of are the same, i.e., a consensus is reached. We 
use the following results in our proof ]Paz, 197T) : 

Proposition 1. Given two stochastic matrices U and V, 
i(UV) < <5(U)<5(V). 

Proposition 2. Given a stochastic matrix U, then <5(U) = 
1- 7 (U). 

Our main result is stated below. 
Theorem 1. When t — > oo, = {^\for every expert i and 
j- 

Proof. Recall that is the stochastic matrix repre- 

senting the experts' opinions after t revisions, and that 
p(*) = pWpt*- 1 ) Now, consider the following sequence: 
(S (F(°)) ,6 (F^) , ...,S (FW)). We are interested in the 
behavior of this sequence when t — » oo. First, we show that 
such a sequence is monotonically decreasing: 



r o) ( fe ) ( fe ) i 

Pl,l PlA Pl,n 
(k) (fe) (fe) 
p(fe) _ P2.1 Pl,2 ' ' ' P2m 

(fe) (fe) (fe) 
. Pn,\ Pn.2 ' ' ' Pn,n _ 

The opinion of each expert i at time t then becomes = 
Pij^j • Algorithm 1 provides an algorithmic de- 
scription of the proposed method. 

In order to prove that all opinions converge towards a con- 
sensual opinion when using the proposed method, consider 
the following functions: 



5 (U) = -maxV \m,k - %,fc| 
2 id f— 
fe— 1 

z 

7(U) = min y^min(Ui, k ,u jk ) 
fe=l 



*(fW) =*(p"»F( f - 1 )) 

K^P^^^F^- 1 )) 

= (l- 7 (pW))<5(F(*- 1 )) 

The second and third lines follow, respectively, from 
Propositions 1 and 2. Since S (U) > for every stochastic 
matrix U, then the above mentioned sequence is a bounded 
decreasing sequence. Hence, we can apply the standard 
monotone convergence theorem iBartle and Sherbert, 20001 
and S (F^ 00 )) = 0. Consequently, all rows of the stochastic 
matrix F'°°) are the same. □ 

In other words, a consensus is always reached under the 
proposed method, and this does not depend on the initial re- 
ported opinions. A straightforward corollary of Theorem 1 is 
that all revised weights converge to the same value. 

Corollary 1. When t — > 00, pf^ = -^,for every expert i and 
3- 



Hence, the proposed method works as if experts were 
continuously exchanging information so that their individual 
knowledge becomes group knowledge and all opinions are 
equally weighted. Since we derive weights from the reported 
opinions, we are then able to avoid some problems that might 
arise when eliciting these weights directly, e.g., opinions do 
not need to be disclosed to others in order for them to assign 
weights, thus preserving privacy. 

The resulting consensual opinion can be represented as 

an instance of the linear opinion pool. Recall that = 

sr n „(%(*-*) - sr n „!*)r» „(*-i)f(*- 2 ) _ _ 

E"=i / 3 j" f j° ) > where /3 = /3 2 , • ■ • , Pn) is a probability 
vector that incorporates all the previous weights. Hence, an- 
other interpretation of the proposed method is that experts 
reach a consensus regarding the weights in equation ([T}. 

3.3 Numerical Example 

A numerical example may clarify the mechanics of the pro- 
posed method. Consider three experts (n = 3) with the fol- 
lowing opinions: fi = (0.9,0.1), f 2 = (0.05,0.95), and 
f3 = (0.2,0.8). According to ([3), the initial distance be- 
tween, say, fi and f 2 is: 



(0.9 - 0.05) 2 + (0.1 - 0.95) 2 



= 0.85 



Similarly, we have that D(fi,fi) = and D(fi,f 3 ) = 
0.7. Using equation (0), we can then derive the weights that 
each expert assigns to the reported opinions. Focusing on 
expert 1 at time t = 1 and setting e = 0.01, we obtain that 
pW = a ( 1 )/o.01,pW = aP/0.86, andp^ - , 
Since these weights must sum to one, we have that 
0.00975 and, consequently, p$ « 0.975, p[]l « 0.011, and 

« 0.014. Repeating the same procedure for all experts, 
we obtain the matrix: 



■ (1) /0.71. 

„(!) ~ 



»(1) 



0.975 0.011 0.014 
0.011 0.931 0.058 
0.013 0.058 0.929 



(i) 



The updated opinion of expert 1 is then f{ 
Ef=iP£j f j ~ (0.8809,0.1191). By repeating the above 
procedure, when t — > oo, converges to a matrix where 
all the elements are equal to 1/3. Moreover, all experts' opin- 
ions converge to the prediction (0.3175,0.6825). An inter- 
esting point to note is that the resulting prediction would be 
(0.3833, 0.6167) if we had taken the average of the reported 
opinions, i.e., expert 1, who has a very different opinion, 
would have more influence on the aggregate prediction. 

4 Consensus and Proper Scoring Rules 

The major assumption of the proposed method is that experts 
prefer opinions that are close to their own opinions. In this 
section, we formally investigate the validity of this assump- 
tion. We start by noting that in the absence of a well-chosen 



incentive structure, the experts might indulge in game playing 
which distorts their reported opinions. For example, experts 
who have a reputation to protect might tend to produce fore- 
casts near the most likely group consensus, whereas experts 
who have a reputation to build might tend to overstate the 
probabilities of outcomes they feel will be understated in a 
possible consensus [Fried man, 1983| . 

Scoring rules are traditional devices used to promote hon- 
esty in forecasting settings [Savag eTl971| . Formally, a scor- 
ing rule is a real-valued function, i?(fi, e), that provides a 
score for the opinion upon observing the outcome 9 e . 

Assuming that experts' utility functions are linear with re- 
spect to the range of the score used in conjunction with the 
scoring rule, the condition that R is strictly proper implies 
that the opinion reported by each expert strictly maximizes 
his expected utility if and only if he is honest. Formally, 
argmaXj/E^ [i?(f/)] = fi, where Ef ; [R(-)] is the f^-expected 

value ofR, i.e., E f . [£((*)] = £* =1 f <e R(f< , e). A well- 
known strictly proper scoring rule is the quadratic scoring 
rule: 



R(f i) e)=2f ite 



k=l 



(5) 



The scoring range of the quadratic scoring rule is [—1,1]. 
The proof that the quadratic scoring rule is indeed strictly 
proper as well as some of its interesting properties can be 
seen in the work by Selten I I 19981 . 

Proper scoring rules have been used as a tool to promote 
truthfulness in a variety of domains, e.g., when sharing 
rewards among a set of agents based on peer evaluations 
I Carvalho and Larson, 2070] |Carvalho and Larson, 201 1 j 
|Carvalho and Larson, 2012] , when incentivizing agents to 
accurately estimate their own efforts to accomplish a task 
l |Bacon et al, 2012), in financi al markets set to aggregate 
agents' private I Hanson, 2003; Hanso n7"2007| , in weather 
forecasting [Gneiting and Raftery, 2007j, etc. 



4.1 Effective Scoring Rules 

Scoring rules can also be classified based on monotonicity 
properties. Consider a metric G that assigns to any pair of 
opinions fi and fj a real number, which in turn can be seen as 
the shortest distance between fi and fj . We say that a scoring 
rule R is effective with respect to G if the following relation 
holds for any opinions fi, fj, and f& [Friedman, 19831: 



G(fj,£,0 < G{f u f k ) 



E fi [Rfi)] > E fi [R(f k )} 



In words, each expert's expected score can be seen as a 
monotone decreasing function of the distance between his 
true opinion and the reported one, i.e., experts still strictly 
maximize their expected scores by telling the truth, and the 
closer a reported opinion is to the true opinion, the higher 
the expected score will be. The property of effectiveness is 
stronger than strict properness, and it has been proposed as a 
desideratum for scoring rules for reasons of monotonicity in 
keeping an expert close to his true opinion [Friedman, 1983 1. 

By definition, a metric G must satisfy the following condi- 
tions for any opinions fi , fj , and f & : 



1. Positivity: G(f{,fj) > 0, for all experts and 
G(fi,fj) = if and only if f, = f,; 

2. Symmetry: G(f l ,f J ) = G(f J ,f l ); 

3. Triangle Inequality: G{i h f k ) < G{i h fj) + G(f}, f fc ). 
The root-mean-square deviation shown in satisfies the 

above conditions. However, equation (0J, taken as a function 

of opinions, is not a true metric, e.g., > and symme- 
try does not always hold. We adjust the original definition 
of effective scoring rules so as to consider weights instead of 
metrics. We say that a scoring rule R is effective with respect 

to a set of weights W = {p ( *\, . . . ,Pi*^,P2,i> ■ ■ • >Pn,n} as- 
signed at any time t > 1 if the following relation holds for 



any opinions f 



(t_i) ( t -i) 



and f , 



ft 
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> Ef-v[R$- 1) )] >E f (,-i,[i2(f/- 1 0] 

In words, each expert's expected score can be seen as a 
monotone increasing function of his assigned weights, i.e., 
the higher the weight one expert assigns to a peer's opinion, 
the greater the expected score of that expert would be if he 
reported his peer's opinion, and vice versa. We prove below 
that the quadratic scoring rule shown in (|5) is effective with 
respect to a set of weights assigned according to 

Proposition 3. The quadratic scoring rule shown in @ 
is effective with respect to a set of weights W — 

{Pi\, ■ ■ ■ ,Pi n ,P2 i) ■ ■ ■ >Pn,r»} assigned at any time t > 1 
according to equation (0. 

Proof. Given an opinion fj, we note that the -expected 
value of the quadratic scoring rule in (|5) can be written as: 

z 

E fj [R(f j )]=J2fi, e R(f j ,e) 



— ^ t I %fj,efi,e fi,e ^ ' / 
e— 1 e— 1 a; — 1 
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3,x 



Now, consider the weights assigned by expert i to the opin- 
ions of experts j and k at time t > 1 according to equation 

©. We have that pf', < pfj. if and only if: 



.,(*) 



..(*) 



2j^-i) )fj (*-i: 
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:c — 1 y— 1 

E^^-E^) 2 



x = l 



f (t-i) [ii ^f^* 1 ' ) | > ii r , 



R { 



At-i) 



□ 



Proposition 3 implies that there is a correspondence be- 
tween weights, assigned according to ©, and expected scores 
from the quadratic scoring rule: the higher the weight one ex- 
pert assigns to a peer's opinion, the greater that expert's ex- 
pected score would be if he reported his peer's opinion, and 
vice versa. Hence, whenever experts are rational, i.e., when 
they behave so as to maximize their expected scores, and their 
opinions are rewarded using the quadratic scoring rule, then 
the major assumption of the proposed method for finding a 
consensual opinion, namely that experts prefer opinions that 
are close to their own opinions, is formally valid. A straight- 
forward corollary of Proposition 3 is that a positive affine 
transformation of the quadratic scoring rule is still effective 
with respect to a set of weights assigned according to (H). 

Corollary 2. A positive affine transformation of the 
quadratic scoring rule R in @, i-e., xR (fj, e) + y,for x > 
and y G 5ft, is effective with respect to a set of weights 

>Pi \i,P2 i> • • ■ >Pn?«} assigned at any time 
t > 1 according to equation (@. 



5 Empirical Evaluation 

In this section, we describe an experiment designed to test 
the efficacy of the proposed method for finding a consensual 
opinion. In the following subsections, we describe the dataset 
used in our experiments, the metrics used to compare differ- 
ent methods to aggregate opinions, and the obtained results. 

5.1 Dataset 

Our dataset was composed by 267 games (256 regular-season 
games and 11 playoff games) from the National Football 
League (NFL) held between September 8th, 2005 and Febru- 
ary 5th, 2006. We obtained the opinions of 519 experts for 
the NFL games from the ProbabilityFootbalQ contest. The 
contest was free to enter. Each expert was asked to report his 
subjective probability that a team would win a game. Pre- 
dictions had to be reported by noon on the day of the game. 
Since the probability of a tie in NFL games is very low (less 
than 1%), experts did not report the probability of such an 
outcome. In particular, no ties occurred in our dataset. 

Not all 519 registered experts reported their predictions for 
every game. An expert who did not enter a prediction for a 
game was removed from the opinion pool for that game. On 
average, each game attracted approximately 432 experts, the 
standard deviation being equal to 26.37. The minimum and 
maximum number of experts were, respectively, 243 and 462. 
Importantly, the contest rewarded the performance of experts 



'Available at http://probabilityfootball.com/2005/ 



via a positive affine transformation of the quadratic scoring 
rule, i.e., 100 — 400 x pf, where pi was the probability that 
an expert assigned to the eventual losing team. 

A positive affine transformation of a strictly proper scor- 
ing rule is still strictly proper |Gneiting and Raft ery, 20071. 
The above scoring rule can be obtained by multiplying (|5) by 
200 and subtracting the result by 100. The resulting proper 
scoring rule rewards bold predictions more when they are 
right. Likewise, it penalizes bold predictions more when they 
are wrong. For example, a prediction of 99% earns 99.96 
points if the chosen team wins, and it loses 292.04 points if 
the chosen team loses. On the other hand, a prediction of 
51% earns 3.96 points if it is correct, and it loses 4.04 points 
if it is wrong. A prediction of 50% neither gains nor loses 
any points. The experts with highest accumulated scores won 
prizes in the contest. The suggested strategy at the contest 
website was "fo make picks for each game that match, as 
closely as possible, the probabilities that each team will win". 

We argue that this dataset is very suitable for our purposes 
due to many reasons. First, the popularity of NFL games pro- 
vides natural incentives for people to participate in the Proba- 
bilityFootball contest. Furthermore, the intense media cover- 
age and scrutiny of the strengths and weaknesses of the teams 
and individual players provide useful information for the gen- 
eral public. Hence, participants of the contest can be viewed 
as knowledgeable regarding to the forecasting goal. Finally, 
the fact that experts were rewarded via a positive affine trans- 
formation of the quadratic scoring rule fits perfectly into the 
theory developed in this work (see Corollary 2). 

5.2 Metrics 

We used two different metrics to assess the prediction power 
of different aggregation methods. 

Overall Accuracy 

We say that a team is the predicted favorite for winning a 
game when an aggregate prediction that this team will win 
the game is greater than 0.5. Overall accuracy is then the per- 
centage of games that predicted favorites have indeed won. A 
polling method with higher overall accuracy is more accurate. 

Absolute Error 

Absolute error is the difference between a perfect prediction 
(1 for the winning team) and the actual prediction. Thus, it is 
just the probability assigned to the losing team (pi). An ag- 
gregate prediction with lower absolute error is more accurate. 

5.3 Experimental Results 

For each game in our dataset, we aggregated the reported 
opinions using three different linear opinion pools: the 
method proposed in Section 3, henceforth referred to as the 
consensual method, with e = 10~ 4 ; the traditional average 
approach, where all the weights in (fl~|i are equal to l/n; and 
the method proposed by Barlow et al. 1 19861, henceforth re- 
ferred to as the BMS method. These authors proposed that the 
weight assigned to expert i's opinion should be w. L = j^ c f . t \ , 
where c is a normalizing constant, J(fj, fj* ) is the Kullback- 
Leibler divergence, and achieves max{/(f,, fj) : 1 < j < 



Table 1 : Average absolute error of each method over the 267 
games. Standard deviations are in parentheses. 



Consensual 


Average 


BMS 


0.4115(0.1813) 


0.4176 (0.1684) 


0.4295 (0.1438) 



n}, i.e., fi* is the most distant opinion from expert i's opin- 
ion. The BMS method produces indeterminate outputs when- 
ever there are probability assessments equal to or 1. Hence, 
we recalibrated the reported opinions when using the BMS 
method by replacing and 1 by, respectively, 0.01 and 0.99. 

Given the aggregated opinions, we calculated the perfor- 
mance of each method according to the accuracy metrics pre- 
viously described. Regarding the overall accuracy of each 
method, the consensual method achieves the best perfor- 
mance in this experiment with an overall accuracy of 69.29%. 
The BMS and average methods achieve an overall accuracy 
of, respectively, 68.54% and 67.42%. 

Table Q] shows the average absolute error of each method 
over the 267 games. The consensual method achieves the 
best performance with an average absolute error of 0.4115. 
We performed left-tailed Wilcoxon signed-rank tests in order 
to investigate the statistical relevance of these results. The 
resulting p-values are all extremely small (< 10~ 4 ), showing 
that the results are indeed statistically significant. 

Despite displaying a decent overall accuracy, the BMS 
method has the worst performance according to the absolute 
error metric. A clear drawback with this method is that it 
only considers the distance to the most distant opinion when 
assigning a weight to an opinion. Since our experimental de- 
sign involves hundreds of experts, it is reasonable to expect at 
least one of them to have a very different and wrong opinion. 

The high number of experts should give an advantage to the 
average method since biases of individual judgment can off- 
set with each other when opinions are diverse, thus making 
the aggregate prediction more accurate. However, the aver- 
age method achieves the worst overall accuracy, and it per- 
forms statistically worse than the consensual method when 
measured under the absolute error metric. We believe this re- 
sult happens because the average method ends up overweight- 
ing extreme opinions when equally weighting all opinions. 

On the other hand, under the consensual method, experts 
put less weight on opinions far from their own opinions, 
which implies that this method is generally less influenced 
by extreme predictions as illustrated in Section 3.3. 

6 Conclusion 

We proposed a pooling method to aggregate expert opin- 
ions. Intuitively, the proposed method works as if the experts 
were continuously updating their opinions, where each up- 
dated opinion takes the form of a linear opinion pool, and the 
weight that each expert assigns to a peer's opinion is inversely 
related to the distance between their opinions. We proved that 
this updating process leads to a consensus. 

A different interpretation of the proposed method is that 
experts reach a consensus regarding the weights of a linear 
opinion pool. We showed that if rational experts are rewarded 



using the quadratic scoring rule, then our major assumption, 
namely that experts prefer opinions that are close to their own 
opinions, follows naturally. To the best of our knowledge, this 
is the first work linking the theory of proper scoring rules to 
the seminal consensus theory proposed by DeGroot 1 19741 . 

Using real-world data, we compared the performance of 
the proposed method with two other methods: the tradi- 
tional average approach and another distance-based aggrega- 
tion method proposed by Barlow et al. [ 19861. The results of 
our experiment show that the proposed method outperforms 
all the other methods when measured in terms of both overall 
accuracy and absolute error. 
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